Skip to content

Consolidate OpenAI server: unify implementations, add multi-protocol support#243

Merged
bernardladenthin merged 23 commits into
mainfrom
claude/clever-clarke-6a6vvg
Jun 20, 2026
Merged

Consolidate OpenAI server: unify implementations, add multi-protocol support#243
bernardladenthin merged 23 commits into
mainfrom
claude/clever-clarke-6a6vvg

Conversation

@bernardladenthin

Copy link
Copy Markdown
Owner

Summary

This PR consolidates the two independent OpenAI-compatible server implementations (OpenAiCompatServer from PR #240 and LlamaServer from #242) into a single unified codebase. The winning implementation is OpenAiCompatServer (JDK com.sun.net.httpserver, streaming SSE, zero extra dependencies), extended with:

  • Multi-protocol API surfaces: Ollama native (/api/chat, /api/generate, /api/tags, /api/version), Anthropic Messages (POST /v1/messages), OpenAI Responses (POST /v1/responses), and llama.cpp-native (POST /infill for fill-in-the-middle)
  • Additional OpenAI routes: POST /v1/completions (text), POST /v1/embeddings, POST /v1/rerank, GET /health (liveness probe)
  • CLI consolidation: OpenAiServerCli replaces LlamaServerArgs, with support for --embedding, --reranking, --mmproj flags
  • Streaming enhancements: stream_options.include_usage passthrough, cached_tokens safety net, CORS/OPTIONS preflight
  • Structured outputs: response_format support for JSON schema validation
  • Stateless protocol translators: Pure JSON-to-JSON mappers for Ollama, Anthropic, and Responses APIs (unit-testable, no model dependency)

The NanoHTTPD dependency is removed; the server now runs on the JDK's built-in HTTP server with zero extra runtime dependencies.

Test plan

  • Affected unit tests pass locally: OpenAiServerCliTest, OllamaApiSupportTest, AnthropicApiSupportTest, ResponsesApiSupportTest, OaiRerankSupportTest, OpenAiSseFormatterTest, OpenAiRequestMapperTest
  • Integration tests pass: OpenAiCompatServerHttpTest, OpenAiCompatServerIntegrationTest, OpenAiServerCompletionIntegrationTest, OpenAiServerEmbeddingsIntegrationTest, OpenAiServerRerankIntegrationTest
  • CI is green on this branch
  • Docs updated: README.md, TODO.md, docs/feature-investigation-ide-agent-backend.md, package-info.java

Related issues / PRs

Closes the "TWO implementations to consolidate" item in TODO.md. Consolidates PR #240 (OpenAiCompatServer) and #242 (LlamaServer / NanoHTTPD) into a single unified server.

Checklist

  • I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
  • My commits follow Conventional Commits
  • No security-sensitive changes

https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF

bernardladenthin pushed a commit that referenced this pull request Jun 19, 2026
OpenAiServerEmbeddingsIntegrationTest loaded CodeLlama-7B with enableEmbedding() only,
which leaves the pooling type NONE (CodeLlama's GGUF reports pooling = -1). The OpenAI
/v1/embeddings path (LlamaModel.handleEmbeddings with oaicompat=true) rejects pooling
NONE, so both test methods received HTTP 500 instead of 200 (Java Tests Ubuntu job on
PR #243).

Set .setPoolingType(PoolingType.MEAN) so CodeLlama produces a single pooled sentence
vector the OAI endpoint can return (MEAN/LAST both work for decoder-only models, per
LlamaEmbeddingsTest). The low-level LlamaModel#embed path is unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Comment thread src/main/java/net/ladenthin/llama/server/LlamaModelBackend.java Fixed
claude added 17 commits June 20, 2026 09:09
…p NanoHTTPD

Two interim OpenAI-compatible servers coexisted in net.ladenthin.llama.server
(PR #240's JDK com.sun.net.httpserver streaming server on top of #242's NanoHTTPD
blocking server). Settle on one: keep the JDK + SSE-streaming core, absorb the
NanoHTTPD server's extra routes / CLI / fat-jar entry point, then delete it.

Survivor: OpenAiCompatServer (dependency-free, embeddable, fat-jar Main-Class).
- Streaming chat via SSE with delta.tool_calls + prefill heartbeats (unchanged).
- Ported routes: POST /v1/completions, POST /v1/embeddings, GET /health.
- Broadened the model-free test seam ChatBackend -> OpenAiBackend (+ completions,
  embeddings); LlamaModelChatBackend -> LlamaModelBackend forwards the two new
  routes to handleCompletionsOai / handleEmbeddings.
- New testable CLI parser OpenAiServerCli (short/long/alias flags, --help,
  validation) replacing the inline arg map and the deleted LlamaServerArgs;
  produces ModelParameters + OpenAiServerConfig.

Deleted NanoHTTPD impl: LlamaServer, LlamaServerArgs, LlamaServerConfig,
OaiHttpServer, OaiRouter, OaiBackend, OaiResponse, LlamaModelOaiBackend
(+ OaiRouterTest, LlamaServerArgsTest, OaiHttpServerIntegrationTest).

Reconciliation:
- pom.xml: drop org.nanohttpd dependency + version; assembly Main-Class ->
  OpenAiCompatServer.
- spotbugs-exclude.xml: retarget CC_CYCLOMATIC_COMPLEXITY to OpenAiServerCli.parse;
  drop the LlamaModelOaiBackend EI_EXPOSE_REP2 entry (survivor is package-private,
  like the old LlamaModelChatBackend, which needed none).
- LlamaArchitectureTest Server layer + com.sun.net.httpserver exception and
  module-info `requires jdk.httpserver` unchanged (still correct for the survivor).
- LlamaModel javadoc link, README, CLAUDE.md, TODO.md, publish.yml comment updated;
  removed the consolidation block and the now-moot "implement SSE" TODO (its premise
  that com.sun.net.httpserver is ArchUnit-banned was wrong: it is the supported,
  exported jdk.httpserver module).

C++ (jllama.cpp / json_helpers.hpp / wrap_stream_chunk + its tests) unchanged: the
streaming path survives.

Verification (model-free): mvn compile test-compile; targeted tests
(LlamaArchitectureTest, OpenAiRequestMapperTest, OpenAiSseFormatterTest,
ChatStreamChunkParserTest, OpenAiCompatServerHttpTest, OpenAiServerCliTest) all
green; javadoc:jar clean; spotless:check clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…ends

Implements the XS+S recommendations from the IDE/agent backend investigation,
targeting agentic tool-calling (Qwen) and local autocomplete:

XS:
- POST /infill route (FIM autocomplete: llama.vscode/Twinny/Tabby/Continue) —
  forwards verbatim to the existing native handleInfill; FIM tokens applied
  server-side from GGUF metadata. New OpenAiBackend.infill + LlamaModelBackend.
- Tolerant routing: every route also reachable without the /v1 prefix.
- cache_prompt defaulted true in the chat mapper (KV-prefix reuse for IDE latency).
- C++ regression guard (#20198): assert tool_calls.function.arguments is a JSON
  STRING, not an object — passes against pinned b9682, so agentic tool-calling is
  wire-correct for the OpenAI SDK / Roo Code / Copilot agent.

S:
- stream_options.include_usage passthrough: OpenAiRequestMapper forwards the
  stream_options object verbatim (new InferenceParameters.withStreamOptions) so
  the native server emits the trailing usage chunk OpenAI clients expect.
- cached_tokens safety net: OpenAiSseFormatter.ensureUsageCachedTokens guarantees
  usage.prompt_tokens_details.cached_tokens is present on the streamed usage chunk,
  fixing the documented Copilot custom-endpoint crash (microsoft/vscode #273482)
  regardless of upstream. Applied in the SSE path; token-delta chunks pass through
  unparsed.
- CORS: a com.sun.net.httpserver Filter answers OPTIONS preflights with 204 +
  Access-Control-Allow-{Origin,Methods,Headers} and stamps Allow-Origin on every
  response. New OpenAiServerConfig.corsAllowOrigin (default "*").

Tests: +infill/alias/CORS HTTP tests, +stream_options mapper test, +5
ensureUsageCachedTokens unit tests, +1 C++ arguments-as-string guard. Full server
+ json + arch suite green (77 model-free tests); C++ tool-call/stream suite green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
… docs

Continues the IDE/agent backend work (Medium items + documentation):

- POST /v1/rerank (+ /rerank, /reranking): RAG document reranking. Native
  handleRerank (made public, consistent with the other handle* methods) returns
  {document,index,score}; OaiRerankSupport reshapes it into the OpenAI rerank
  response with sorted {index, relevance_score}, top_n, and a `data` alias of
  `results` (Continue #6478). New OpenAiBackend.rerank + LlamaModelBackend.rerank.
- response_format passthrough (json_object / json_schema) for OpenAI structured
  outputs (new InferenceParameters.withResponseFormat; mapper forwards verbatim).
- Vision: --mmproj CLI flag (image_url content parts already pass through verbatim).
- CLI: --reranking (enableReranking), --mmproj (setMmproj) on OpenAiServerCli.

Docs:
- New docs/feature-investigation-ide-agent-backend.md (the deep-research report +
  an implementation-status preamble).
- README endpoints table + notes (rerank/infill, CORS, /v1-less aliases, response_
  format, the Copilot inline-completion limitation), CLAUDE.md server bullet,
  package-info, and TODO.md (DONE list + the deferred decisions: Ollama emulation,
  Anthropic /v1/messages + OpenAI /v1/responses shims, Continue native /completion,
  per-model FIM registry, /props).

Tests: +OaiRerankSupportTest (10), +rerank HTTP route, +response_format mapper test,
+--reranking/--mmproj CLI tests. Full server+json+arch suite green (138 tests);
javadoc + spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Adds an Ollama-compatible surface so Copilot's built-in Ollama provider and
Ollama-hardcoded tools can drive the local model, translating to/from the internal
OpenAI chat path (no second inference path):

- GET /api/version, GET /api/tags, POST /api/show — discovery; /api/show advertises
  capabilities (completion/tools/insert [+vision when --mmproj]) and context length,
  which Copilot reads to enable tools/vision and size prompts.
- POST /api/chat — non-streaming (single JSON) and streaming (NDJSON, one object per
  line, terminated by a "done":true line). Request options (num_predict→max_tokens,
  temperature/top_p/top_k/seed/stop) and `format` (json / schema → response_format)
  are mapped; Ollama tool-call arguments (object) ↔ OpenAI (JSON string) are converted
  both ways.
- ToolCallDeltaAccumulator: reusable helper that reconstructs whole tool calls from
  OpenAI streaming delta.tool_calls fragments (shared by the non-OpenAI shims that
  deliver tool calls whole). Streamed tool calls are emitted on the Ollama done line.
- OpenAiServerConfig.supportsVision (set by --mmproj) feeds the /api/show vision flag.

All pure translation lives in OllamaApiSupport + ToolCallDeltaAccumulator (model-free
unit-tested); the server handlers are thin and reuse the OpenAiBackend seam.

Tests: +OllamaApiSupportTest (12), +ToolCallDeltaAccumulatorTest (3), +Ollama HTTP
route tests (version/tags/show/chat non-stream + NDJSON stream). Server+arch suite green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…shims

Adds two more client-protocol surfaces over the internal OpenAI chat core (pure
translation, no second inference path; tool calls reconstructed from OpenAI
delta.tool_calls via the shared ToolCallDeltaAccumulator):

Anthropic Messages (POST /v1/messages):
- Request: system string/blocks → system message; content blocks (text / tool_use /
  tool_result) flattened to OpenAI messages (a user tool_result → a role:"tool"
  message); Anthropic tools (input_schema) → OpenAI function tools; tool_choice
  auto/any mapped.
- Non-streaming response: text + tool_use content blocks, stop_reason mapping
  (tool_calls→tool_use, length→max_tokens), usage.
- Streaming: the Anthropic SSE event sequence via AnthropicStreamTranslator
  (message_start → text content block start/delta/stop → tool_use blocks → message_
  delta → message_stop), with heartbeats.

OpenAI Responses (POST /v1/responses):
- Request: instructions → system; input string/array (message / function_call /
  function_call_output items) → OpenAI messages; flat function tools → nested.
- Non-streaming response: a `response` object whose output holds a message item
  (output_text) + one function_call item per tool call, with usage.
- Streaming: the Responses SSE event sequence via ResponsesStreamTranslator
  (response.created → output_item.added/content_part.added → output_text.delta* →
  *.done → function_call items → response.completed), with monotonic
  sequence_number, and heartbeats.

Both surfaces are reachable with and without the /v1 prefix and behind the CORS
filter. Docs (README endpoints table, CLAUDE.md server bullet, package-info, TODO)
updated; the two items are moved from "deferred" to done.

Tests: +AnthropicApiSupportTest (6), +AnthropicStreamTranslatorTest (4),
+ResponsesApiSupportTest (5), +ResponsesStreamTranslatorTest (4), + Anthropic/
Responses HTTP route tests (non-stream + stream). Full server+json+arch suite green
(251 tests); javadoc + spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Fills the remaining tractable IDE-backend endpoints (those not blocked on a native
streaming-completion path):

- GET /props (llama.cpp-native): reports default_generation_settings.n_ctx and a
  modalities block (vision from --mmproj), which autocomplete clients such as
  llama.vscode read to size their context window. OpenAiSseFormatter.propsJson;
  unauthenticated like /health.
- POST /api/generate (Ollama-native prompt completion / FIM): maps to the native
  /v1/completions handler, or to /infill when a `suffix` is present (FIM). Options
  (num_predict→max_tokens, temperature/top_p/top_k/seed/stop) are mapped. Non-streaming
  returns one JSON; stream:true returns NDJSON (a content line + a done line).
  Generation completes before emission — documented as a single content chunk, since
  there is no streaming raw-completion path (tracked in TODO as the shared blocker for
  streaming /v1/completions, token-streaming /api/generate, and Continue's native
  /completion).

Docs (README, CLAUDE.md, package-info, TODO) updated; TODO now records the streaming
raw-completion JNI path as the one remaining blocker and trims the items it gates.

Tests: +/props + /api/generate translator unit tests (OllamaApiSupportTest,
OpenAiSseFormatterTest) and HTTP route tests (props, generate non-stream/FIM/stream).
Full server+json+arch suite green (261 tests); javadoc + spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Make TODO.md reflect the shipped state accurately and capture every remaining item:

- Intro now lists the complete surface: the OpenAI routes plus /props and the three
  alternative protocol surfaces (Ollama /api/*, Anthropic /v1/messages, OpenAI
  /v1/responses) — previously it stopped at /health.
- Add the one genuinely-open item that was missing: live end-to-end validation of the
  Ollama/Anthropic/Responses surfaces against a real model + real clients (today they
  are covered only by model-free unit + fake-backend HTTP tests; only the OpenAI chat
  path has a gated integration test).

Done items remain marked DONE; the deferred follow-ups (streaming raw-completion JNI
path and what it gates, incremental tool-call streaming, per-model FIM registry,
multi-model registry, Gemma 4 validation) are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…cols

CI's test-java-linux-x86_64 job already has the three ingredients in one place —
the Linux x86_64 native lib (downloaded artifact), the models (incl. Qwen3-0.6B),
and `mvn test` — so OpenAiCompatServerIntegrationTest already round-trips the OpenAI
chat path over a real socket. Extend that same gated fixture (same loaded Qwen3
model, self-skips when absent) to smoke the new surfaces end-to-end:

- Ollama /api/chat (non-stream + NDJSON stream) and discovery (/api/version, /api/tags,
  /api/show).
- Anthropic /v1/messages (non-stream + SSE event stream).
- OpenAI /v1/responses (non-stream + SSE event stream).
- /props.

Assertions are structural only (markers the translators always emit: Ollama
"done":true, Anthropic event: message_start/message_stop, Responses
event: response.created/response.completed, response object shape) so they are robust
to the tiny model's wording — matching the existing chat round-trip's approach.

Embeddings / rerank / infill round-trips still need their own server fixtures in the
matching mode (models already in CI); tracked in TODO.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…_result-only turn

Low-hanging unit-test coverage on the recently-added protocol translators (all pure,
model-free), plus one correctness fix surfaced by a new test:

Fix:
- AnthropicApiSupport: a user turn carrying only tool_result blocks now emits exactly
  one role:"tool" message instead of that plus a spurious empty {"role":"user",
  "content":""}. Guarded with a hadToolResult flag; assistant tool-call turns still
  carry null content.

New / extended tests:
- OpenAiServerConfigTest (new): builder defaults, isAuthenticationEnabled (null/empty/
  set), CORS + vision knobs, and the security contract that toString() never leaks the
  API key (only authEnabled boolean).
- OpenAiServerCli: --mmproj now asserted to flip toServerConfig().supportsVision.
- AnthropicApiSupport: system-as-blocks concatenation + stop_sequences mapping,
  tool_choice "any" -> "required", and the fixed tool_result-only branch.
- OllamaApiSupport: format-as-schema -> response_format json_schema; options.stop
  forwarding.
- ResponsesApiSupport: no-instructions (no system message), assistant message item,
  and non-function tools dropped.
- OpenAiCompatServerHttpTest: GET on the new POST routes (/v1/rerank, /v1/messages,
  /v1/responses, /api/chat, /api/generate) returns 405 (shared requirePostJson preamble).

Server + arch suite green (150 model-free tests); javadoc + spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…ompletion/infill/generate

Completes the live end-to-end coverage of the IDE-backend surfaces. Each fixture boots
a real server over a real socket in the matching model mode, reuses a model CI already
downloads, self-skips when absent, and asserts structural shapes only:

- OpenAiServerEmbeddingsIntegrationTest (CodeLlama-7B + enableEmbedding): POST
  /v1/embeddings returns an OpenAI {object:list, data:[{object:embedding, embedding:[…]}]}
  shape; also covers the bare /embeddings alias.
- OpenAiServerRerankIntegrationTest (jina-reranker + enableReranking): POST /v1/rerank
  returns sorted {index, relevance_score} results capped by top_n, with the `data` alias.
- OpenAiServerCompletionIntegrationTest (CodeLlama-7B): POST /v1/completions, /infill, and
  Ollama /api/generate (plain + FIM via `suffix`) — CodeLlama is FIM-capable per
  LlamaModelTest#testGenerateInfill.

Also: add TestConstants.RERANKING_MODEL_PATH and route RerankingModelTest through it
(removes the duplicated literal). Used Java-8-safe idioms throughout.

These run in the same CI job that already round-trips the OpenAI chat path, so the
Ollama/Anthropic/Responses/embeddings/rerank/completion surfaces are now all validated
end-to-end against real models; only manual editor-client validation remains (TODO).

Server + arch suite green (integration fixtures self-skip without models locally);
javadoc + spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…nsolidated server

The SonarQube "Build and analyze" job runs spotbugs at effort=Max/threshold=Low with
fb-contrib + findsecbugs, which flagged 30 Low-priority findings across the new
net.ladenthin.llama.server.* classes. Resolved as follows.

Fixed in code:
- Add Lombok @tostring to the stateful infra classes (AnthropicStreamTranslator,
  ResponsesStreamTranslator, ToolCallDeltaAccumulator, LlamaModelBackend,
  OpenAiCompatServer) — the project's established remedy for
  IMC_IMMATURE_CLASS_NO_TOSTRING (see Java8CompatibilityHelper). @tostring on the
  server is leak-safe: it renders the config via OpenAiServerConfig.toString(), which
  already redacts the API key.
- Add @EqualsAndHashCode to the immutable OpenAiServerConfig value type
  (IMC_IMMATURE_CLASS_NO_EQUALS).
- printReady(): println('[')/println(']') instead of length-1 strings (UCPM).

Suppressed (documented in spotbugs-exclude.xml) as by-design / false positive on
protocol-infrastructure code, mirroring the existing server PATH_TRAVERSAL/CRLF and
OpenAiServerCli.parse entries:
- IMPROPER_UNICODE: equalsIgnoreCase on ASCII HTTP method tokens (RFC 7230/7231).
- LEST: the UncheckedIOException -> IOException unwrap rethrows the original cause.
- WEM: input-validation precondition guards (same shape as ChatMessage.requireNonNull).
- MDM: main() blocks forever to keep the JVM alive for the daemon HTTP threads.
- NOS: per-request stream write lock passed as a parameter (not a shared field).
- MRC: sseDone()/heartbeat() are self-documenting SSE protocol-token accessors.
- PRMC: ResponsesApiSupport.dataObject() is a fresh-node factory, not cacheable.
- IMC_NO_EQUALS on the identity-managed ToolCallDeltaAccumulator.

Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs); spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
OpenAiServerEmbeddingsIntegrationTest loaded CodeLlama-7B with enableEmbedding() only,
which leaves the pooling type NONE (CodeLlama's GGUF reports pooling = -1). The OpenAI
/v1/embeddings path (LlamaModel.handleEmbeddings with oaicompat=true) rejects pooling
NONE, so both test methods received HTTP 500 instead of 200 (Java Tests Ubuntu job on
PR #243).

Set .setPoolingType(PoolingType.MEAN) so CodeLlama produces a single pooled sentence
vector the OAI endpoint can return (MEAN/LAST both work for decoder-only models, per
LlamaEmbeddingsTest). The low-level LlamaModel#embed path is unaffected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…p opaque-field toString

Two static-analysis findings on the new server code, both surfaced once the spotbugs fix let
the Maven build reach the sonar:sonar step for the first time:

1. SonarCloud quality gate (E Reliability on New Code, S2095 BLOCKER): the four streaming
   handlers (streamChat / streamOllamaChat / streamAnthropic / streamResponses) opened
   os = exchange.getResponseBody() and closed it via the closeQuietly() helper in finally.
   Sonar's S2095 does not trace closes through a helper, so it saw a possibly-unclosed
   resource. Fixed by closing os directly in the finally, under the per-stream write lock so
   the close still never races an in-flight heartbeat write; the closeQuietly helper is
   removed. Also moved heartbeatExecutor.scheduleAtFixedRate inside the try so a scheduling
   failure can no longer leak the stream (a real, if rare, pre-existing leak).

2. CodeQL "Use of default toString()" on LlamaModelBackend (non-blocking alert): the @tostring
   added in the previous commit rendered fields whose classes only inherit Object.toString
   (the request mapper and native model handle; HttpServer and the CORS Filter on the server).
   Dropped @tostring from LlamaModelBackend and OpenAiCompatServer — opaque-resource/service
   classes where a generated toString only emits identity hashes — and suppressed
   IMC_IMMATURE_CLASS_NO_TOSTRING for them with rationale. The translator/accumulator classes
   keep @tostring (their fields render meaningful state and are CodeQL-clean).

Verified locally: spotbugs:check -> 0 bugs; 67 server unit tests pass (incl. the 32
OpenAiCompatServerHttpTest streaming-path tests over real sockets); javadoc + spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…rCloud S2445)

The SonarCloud quality gate kept failing with "E Reliability Rating on New Code" even after the
previous commit reworked the streaming close, and the gate was unchanged across that push — the
signal that the Blocker reliability bug was the synchronization TARGET, not the close. The
streaming write helpers synchronized on a writeLock passed as a method parameter (and, after the
last commit, on a method-local Object), which SonarCloud flags as java:S2445 "Blocks should be
synchronized on read-only fields" — a Blocker reliability bug. (fb-contrib's NOS flagged the same
code; that suppression masked it from spotbugs but Sonar evaluates it independently.)

Fix: introduce a small per-request AutoCloseable ResponseStream that owns the response
OutputStream and a private final lock, and serializes writeStrict / writeQuietly / close on that
owned lock. The four streaming handlers drive it via try-with-resources, so the stream is closed
(under the lock, after the heartbeat is cancelled) on every path — which also satisfies S2095 and
lets the now-stale NOS suppression be removed. Per-request locking is preserved (each request has
its own lock), so independent concurrent streams never serialize against each other.

Verified locally: spotbugs:check -> 0 bugs; 48 server unit tests pass incl. the 32
OpenAiCompatServerHttpTest streaming-path tests over real sockets; spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…bjllama

First step toward driving the OpenAI-compatible server natively from JNI,
shipped inside libjllama rather than as a standalone llama-server executable
(a JNI .so/.dll/.dylib loads anywhere a JVM runs; a separate binary does not,
which is the whole point of preferring the JNI path here). This commit only
makes the HTTP layer build and link — no JNI route wiring yet.

What changed (CMakeLists.txt):
- Compile tools/server/server-http.cpp (the upstream server_http_context HTTP
  transport) and vendor/cpp-httplib/httplib.cpp directly into jllama, on all
  platforms (the getifaddrs API-24 gate cpp-httplib needs on Android is already
  satisfied by the existing __ANDROID_UNAVAILABLE_SYMBOLS_ARE_WEAK__ define).
- <cpp-httplib/httplib.h> already resolves via llama-common's vendor/ include
  dir, whose bundled nlohmann/json is the same 3.12.0 as our FetchContent copy,
  so nothing is shadowed and no extra include dir is required for it.
- Mirror upstream's cpp-httplib tuning defines (payload/URI/backlog limits,
  TCP_NODELAY) on jllama so httplib.cpp and the server-http.cpp that includes
  httplib.h agree on the inline behaviour those macros control.
- Silence httplib.cpp warnings (-w / /w), matching upstream's own target.
- Link ws2_32 on MinGW (MSVC auto-links it via a pragma in httplib.h).
- No SSL: CPPHTTPLIB_OPENSSL_SUPPORT is left undefined (plain HTTP for now;
  bind localhost or front with a TLS proxy).

WebUI stub (src/main/cpp/webui_stub/ui.h):
- server-http.cpp does #include "ui.h" — the asset table tools/ui (llama-ui)
  normally GENERATES via the llama-ui-embed host tool. We do not ship the Svelte
  WebUI (it needs npm or a prebuilt-asset download), so this header supplies the
  exact "empty asset table" interface embed.cpp emits for n_assets == 0: the
  llama_ui_asset struct plus llama_ui_find_asset / llama_ui_use_gzip /
  llama_ui_get_assets. LLAMA_UI_HAS_ASSETS is intentionally left undefined, so
  every static-asset-serving block in server-http.cpp compiles out; the single
  unguarded use iterates the (empty) asset list. Header-only (.h) so it is
  outside the clang-format glob, which only covers *.cpp/*.hpp.

server.cpp (standalone main() + route wiring) stays excluded — wiring those
routes to a JNI entry point is the next step.

Verified locally (Linux x86_64):
- cmake --build --target jllama -> [100%] Built target jllama (clean).
- libjllama.so contains server_http_context::init/start/stop (T) and ~1.8k
  httplib symbols, with zero undefined server-http/httplib symbols.
- NativeLibraryLoadSmokeTest: Tests run: 1, Failures: 0, Skipped: 0 (the larger
  lib still loads and JNI_OnLoad resolves every referenced Java class).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…uilds

Implements the build-once-share-artifact approach for embedding the llama.cpp Svelte
WebUI into libjllama, so the in-process server (server-http.cpp) serves the real UI
instead of the empty-asset stub. The repo commits no build outputs, so the WebUI is
produced per-pipeline and never checked in (same policy as the native libs).

CI (.github/workflows/publish.yml):
- New build-webui job (ubuntu; the only job that runs npm): resolves the pinned
  b<nnnn> tag from CMakeLists.txt GIT_TAG, sparse-checks-out ggml-org/llama.cpp@<tag>
  tools/ui, runs the upstream Svelte build (npm ci && npm run build), gzips dist/
  (LLAMA_UI_GZIP parity), builds the self-contained llama-ui-embed host tool (plain
  C++17, no npm) and runs it to produce the platform-independent webui-generated/
  ui.cpp + ui.h, uploaded as the webui-generated artifact.
- All 10 release-artifact build jobs now use needs:[startgate, build-webui] and
  download the artifact into webui-generated/ before building. npm never runs in the
  dockcross cross-compilers (no node) or per-platform — only once, in one job.

CMake (CMakeLists.txt "WebUI assets" block):
- When webui-generated/ui.cpp + ui.h are present, compile ui.cpp in and add its dir
  to the include path; the generated ui.h #defines LLAMA_UI_HAS_ASSETS, activating
  server-http.cpp's static-asset routes (compiled out under the stub). When absent,
  fall back to the empty-asset stub webui_stub/ui.h so local builds and any job
  without the artifact still build and run (no embedded UI).

The WebUI version auto-follows GIT_TAG, so a llama.cpp bump needs no extra step.
webui-generated/ is git-ignored; CLAUDE.md documents the pipeline + a local recipe.

Verified locally (Linux x86_64) with the real llama-ui-embed tool (no npm) on a
synthetic 9-asset dist: generated ui.cpp/ui.h carry LLAMA_UI_HAS_ASSETS + use_gzip
(9 assets); jllama rebuilds with server-http.cpp's asset routes compiled in, ui.cpp
compiled, libjllama.so linked (llama_ui_get_assets/find_asset defined, 0 undefined);
NativeLibraryLoadSmokeTest passes; removing webui-generated/ -> CMake reports the stub
fallback and jllama still builds. publish.yml parses (pyyaml); exactly the 10 native
build jobs gate on build-webui.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…/npm

The SonarCloud quality gate failed on exactly two conditions (Reliability E and
Security C on new code), driven by 2 Blocker bugs + 25 Major security findings. (The
~45 other annotations are maintainability code smells that do not affect the gate.)

Reliability (java:S2095, OpenAiCompatServer.main L888/L889): Sonar did not trace that
the LlamaModel and OpenAiCompatServer are closed by the shutdown hook, so it flagged
them as never closed. Refactor main() to hold both in a try-with-resources; a two-latch
shutdown keeps termination graceful and race-free (the hook signals stopRequested then
waits on cleanedUp, so the JVM — which blocks until shutdown hooks return — does not
halt until the close has run). This also closes the model if server startup throws,
which the previous code did not.

Security (.github/workflows/publish.yml):
- npm ci -> npm ci --ignore-scripts in the build-webui job, so dependency lifecycle
  scripts do not run during install (the WebUI build still runs via `npm run build`).
- Every curl model-download now passes --proto =https --proto-redir =https, so neither
  the URL nor any redirect can downgrade to cleartext HTTP (the URLs are already https;
  this enforces it). 31 invocations hardened.

These are exactly the 2 reliability + 25 security issues SonarCloud listed, so both
ratings should return to A. Verified: mvn spotless:apply + compile clean; publish.yml
parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…ble surfaces

#244 made the chat core honor parallel_tool_calls, but only the OpenAI
/v1/chat/completions surface forwarded it; the alternative protocol surfaces (which
translate into that same chat core) silently dropped the equivalent flag. Close the gap:

- Anthropic /v1/messages (AnthropicApiSupport.toOpenAiChatRequest): map
  tool_choice.disable_parallel_tool_use=true -> parallel_tool_calls=false (default stays
  parallel when unset/false).
- OpenAI Responses /v1/responses (ResponsesApiSupport.toOpenAiChatRequest): forward
  parallel_tool_calls, and also forward tool_choice (string form), which was being dropped
  entirely — both now reach the shared OpenAiRequestMapper.

Tests:
- AnthropicApiSupportTest / ResponsesApiSupportTest: unit-cover the new mappings (set, and
  omitted-when-absent).
- OpenAiServerToolCallingIntegrationTest (new): real-model end-to-end over HTTP using the
  Qwen2.5-1.5B tool model #244 wired into CI. tool_choice="required" forces a call, so it
  deterministically asserts the server returns a well-formed tool_calls array (arguments as a
  JSON string, llama.cpp #20198) and that parallel_tool_calls=false travels
  HTTP -> mapper -> native intact. Self-skips when the model is absent.

Verified locally: spotless, compile, spotbugs clean; model-free translator tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
claude added 2 commits June 20, 2026 10:42
…mpiler cache

Two complementary fixes for the macOS build, behind a new `use_cache`
workflow_dispatch input (default true). Phase 1 = macOS only.

BUILD_JOBS knob (previously investigated, never landed): build.sh now honors
$BUILD_JOBS for the cmake -j level (default = all cores; portable nproc/sysctl
detection), and the 3 macOS build jobs set BUILD_JOBS=2. GitHub's ~7 GB macOS arm64
runners OOM under -j$(nproc) when the 16.6k-line httplib.cpp co-schedules with the model
TUs; the runner is then killed as SIGTERM/143 ("received a shutdown signal") — not a
real timeout. Capping concurrent compiles bounds peak memory.

sccache -> Depot Cache (WebDAV): build.sh routes the compiler through sccache
(-DCMAKE_*_COMPILER_LAUNCHER) only when USE_CACHE=true AND sccache + a cache token are
present, then prints `sccache --show-stats`. The 3 macOS jobs brew-install sccache and
set SCCACHE_WEBDAV_ENDPOINT=https://cache.depot.dev +
SCCACHE_WEBDAV_TOKEN=${{ secrets.DEPOT_TOKEN }}. Because llama.cpp is pinned, the ~280
upstream object files are content-identical every run, so a warm cache recompiles only
changed files — staying -O3, bit-identical and release-safe. Depot's cache is shared
across all branches, so every branch builds incrementally (and warm builds also cut the
macOS memory pressure further).

Safety: inert until the DEPOT_TOKEN secret exists and on fork PRs (secrets hidden) —
those just compile normally; the install step is continue-on-error and use_cache=false
forces a clean from-scratch build. build.sh gating verified locally across all four
cases (warm / Linux-untouched / no-token / explicit-off).

Phase 2 (later): dockcross Linux/Android/CUDA (needs the token + sccache binary passed
into the container), Windows, and the Linux-host test-cpp job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
… CLAUDE.md

- README: a small "Build cache:" badge group crediting Depot (sccache → Depot Cache),
  matching the existing tool-badge style.
- CLAUDE.md: a "CI build cache & parallelism (sccache + Depot)" section for maintainers —
  the BUILD_JOBS knob (macOS -j2 / OOM rationale), the sccache WebDAV → Depot wiring
  (SCCACHE_WEBDAV_ENDPOINT/TOKEN, the DEPOT_TOKEN secret), content-addressed / pinned-tag
  hit behavior, release-safety, the use_cache flag + fork-PR/no-token inertness, and the
  phase-1 (macOS) vs phase-2 (dockcross/Windows/Linux-host) rollout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…st builds

Extends the shared sccache -> Depot compiler cache (phase 1 was macOS-only) to the
time-consuming cross-compiles so they build incrementally on a warm cache too.

- build.sh: when caching is requested but no sccache is on PATH (the dockcross
  manylinux/Android containers and Linux hosts don't ship it; macOS uses brew), fetch the
  static musl sccache v0.8.2 binary (validated: runs + has WebDAV support). Best-effort
  and inert-safe — any fetch/network failure leaves sccache absent and the build proceeds
  uncached.
- publish.yml: the 4 dockcross jobs (manylinux_2_28 CUDA, manylinux2014, Linux aarch64,
  Android aarch64) and the Linux-host C++ test job set USE_CACHE +
  SCCACHE_WEBDAV_ENDPOINT/TOKEN. The dockcross jobs also set
  DOCKCROSS_ARGS="-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE" so the
  wrapper forwards them into the container ($FINAL_ARGS, sourced from DOCKCROSS_ARGS, is
  injected into docker run).

CUDA note: only the C/C++ launcher is set, so the gcc-compiled bulk (134 model TUs + ggml
+ httplib) caches; the nvcc .cu kernels still compile normally (sccache's nvcc support is
limited) — a large but not total speedup there.

Inert-safe as before: no token / fork PRs / use_cache=false -> normal uncached build;
artifacts stay -O3 and bit-identical. Verified locally: build.sh gating across no-token /
present / fetch-fail / disabled; sccache v0.8.2 download; YAML parses; DOCKCROSS_ARGS only
on the 4 container jobs.

Phase 2b TODO: the Android-OpenCL job (separate build_opencl_android.sh) and the Windows
jobs (build.bat + MSVC via sccache-action).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…ity gate)

The phase-2 sccache fetch in build.sh used `curl -fsSL` (which follows redirects via -L)
without --proto =https --proto-redir =https, tripping the same "Not enforcing HTTPS /
redirections to insecure websites" Major hotspot the model-download curls were already
hardened against — which dropped the New-Code Security Rating to C and failed the gate.

Add the proto flags so neither the URL nor the GitHub release redirect can downgrade to
cleartext. Verified the download still succeeds through the
github.com -> objects.githubusercontent.com (HTTPS) redirect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…crashed in-container)

The phase-2 cross-compile caching broke the dockcross builds: inside the manylinux
container the fetched sccache panicked while wrapping the cross-compiler (log: a Rust
backtrace + "Run with SCCACHE_LOG=debug", failing ggml.c.o), and because sccache is the
compiler launcher that fails the whole build. The build.sh "inert-safe" guard only covers
sccache being *absent*, not present-but-crashing.

Remove the sccache env from the 4 dockcross jobs (manylinux_2_28 CUDA, manylinux2014,
Linux aarch64, Android aarch64) and the Linux-host C++ test job. With no token reaching
those jobs, build.sh skips the fetch and they compile normally (uncached, green) again.

macOS caching (phase 1) is unaffected and stays — it uses native clang + brew sccache and
is proven working/fast. The build.sh fetch logic remains but is dormant without a token.

Dockcross caching is deferred: it needs an sccache-in-container health check (probe-compile
before trusting the launcher) and SCCACHE_LOG=debug diagnosis of the cross-compiler panic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
@sonarqubecloud

Copy link
Copy Markdown

@bernardladenthin bernardladenthin merged commit 0b54f5f into main Jun 20, 2026
33 of 34 checks passed
@bernardladenthin bernardladenthin deleted the claude/clever-clarke-6a6vvg branch June 20, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants